Time Series Forecasting

Store-Sales Time Series Forcasting from Kaggle.

Ong Zhi Rong Jordan https://example.com/norajones (Spacely Sprockets)https://example.com/spacelysprokets
2022-08-20

Introduction

This study will be forcusing on time-series forecasting for store sales. The data is extracted from Kaggle. The data is provided by an Ecuador company known as Corporación Favorita.

Methodology

I will be exploring the modeltime library to conduct nested forecasting for the dataset provided.

Dataset

There are a total of 54 stores and 33 product families in the data. The time series starts from 01 Jan 2013 and finishes in 31 Aug 2017. The data is splitted to train and test data, and the dates in the test data are 15 days after the last date in the training data.

The dataset consist of 6 different worksheets. The description of each worksheets is as such:

In addition to the data provided through the data set, there are two pointers to be aware of:

Libraries

These are the libraries used for the analysis.

pacman::p_load(tidyverse, tidymodels, 
               timetk, modeltime,ggstatsplot,lubridate, trelliscopejs, seasonal,
               tsibble, feasts, fable, forecast,psych)
oil <- read_csv ("rawdata/oil.csv")
holiday <- read_csv ("rawdata/holidays_events.csv")
test <- read_csv ("rawdata/test.csv")
train <- read_csv ("rawdata/train.csv")
stores <- read_csv ("rawdata/stores.csv")
transacation <- read_csv ("rawdata/transactions.csv")
describe(oil)
           vars    n  mean    sd median trimmed  mad   min    max
date          1 1218   NaN    NA     NA     NaN   NA   Inf   -Inf
dcoilwtico    2 1175 67.71 25.63  53.19   66.96 18.7 26.19 110.62
           range skew kurtosis   se
date        -Inf   NA       NA   NA
dcoilwtico 84.43 0.32    -1.61 0.75
oilplot <- ggplot(oil,aes(x = date,y=dcoilwtico)) +
  geom_line(colour = "#468499") +
  ylim(25,115) +
  theme_classic() +
  labs(y = "Oil Price", x = "Date", title = "Daily Oil Price from 2013 - 2017", subtitle = "Without Linear Interplolation") +
  theme(axis.title.y = element_text(angle=0))

oilplot 

ts_oil <- oil %>%
  as_tsibble(index = `date`)

ts_oil$dcoilwtico <- (na.interp(ts_oil$dcoilwtico))


oilplot <- ggplot(ts_oil,aes(x = date,y=dcoilwtico)) +
  geom_line(colour = "#D2288A") +
  ylim(25,115) +
  theme_classic() +
  labs(y = "Oil Price", x = "Date", title = "Daily Oil Price from 2013 - 2017", subtitle = "With Linear Interplolation") +
  theme(axis.title.y = element_text(angle=0))

oilplot 

We assume that the oil prices does not change during the weekends or holidays. Therefore, we use the fill function to fill up the weekends NA value based on the previous oil price values.

ts_oil_fill <- ts_oil %>%
  complete(date = seq.Date(min(date), max(date), by = "day" )) %>%
  fill (dcoilwtico)


ts_oilfill <- ggplot(ts_oil_fill,aes(x = date,y=dcoilwtico)) +
  geom_line(colour = "#D2288A") +
  ylim(25,115) +
  theme_classic() +
  labs(y = "Oil Price", x = "Date", title = "Daily Oil Price from 2013 - 2017", subtitle = "With Linear Interplolation and fill") +
  theme(axis.title.y = element_text(angle=0))

ts_oilfill

Relationship between Sales and Oil Prices

train_correlation <- train %>%
  group_by (date) %>%
  summarise (total_sales = sum(sales)) %>%
  ungroup()

train_oil <- train_correlation %>%
  left_join (ts_oil_fill, by = "date")
ggscatterstats(
  data  = train_oil,
  x     = total_sales,
  y     = dcoilwtico,
  xlab  = "Total Sales",
  ylab  = "Daily Oil Price",
  title = "Checking for Correlation between Sales and Oil prices ",
  type = "np"
)

Family Product against oil prices

In this analysis, we will analyse if the sales of every product family is affected by oil prices.

train_productfamily <- train %>%
  group_by (date,family) %>%
  summarise (total_sales = sum(sales)) %>%
  ungroup()

train_PF <- train_productfamily %>%
  left_join (ts_oil_fill, by = "date")
grouped_ggscatterstats(
  data  = train_PF,
  x     = total_sales,
  y     = dcoilwtico,
  grouping.var = family,
  xlab  = "Total Sales",
  ylab  = "Daily Oil Price",
  type = "np",
  plotgrid.args    = list(nrow = 8)
)